Introduction

blablabla

Data description and preprocessing

The datasets used in the below analysis were sourced from www.kaggle.com website 1. They were created based on several sources including the Bureau of Justice Statistics 2 and FBI Uniform Crime Reporting Program 3. The National Prisoner Statistics Program conducted by the Bureau of Justice Statistics has collected data on the number of prisoners in state and federal prison facilities since 1926. It is produced annually on national and state level. Data are sourced from the 50 state departments of correction, the Federal Bureau of Prisons, and until 2001, from the District of Columbia. The UCR Program provides statistics on violent and property crimes. Data are collected annually and are available on national, state and city level. For the purposes of our analysis we are using state-level statistics.

Additionally, we individually collected data on prison expenditures provided by the Bureau of Justice Statistics 4 for each state in 2016 which is the lastest data available. Later in the analysis we will use them in order to correlate the expendirutes with the occurence of particular crimes.

UCR

The UCR dataset consist of 15 variables, two of which are the jurisdiction and year of the observation. It provides information about the state population and also about number of violent crimes (murder, manslaughter, rape, robbery, aggravated assault) and property crimes (burglary, larceny, vehicle theft) per state yearly. Detailed definitions of each crimes can be found on UCR Program website.

The crime_reporting_change variable reflects instances when states’ reporting standards changed. The crimes_estimated variable indicates cases where the FBI computes estimates for participating agencies not providing 12 months of complete data for state 5.

ucr <- read_csv("data/ucr_by_state.csv")
colnames(ucr)
##  [1] "jurisdiction"           "year"                  
##  [3] "crime_reporting_change" "crimes_estimated"      
##  [5] "state_population"       "violent_crime_total"   
##  [7] "murder_manslaughter"    "rape_legacy"           
##  [9] "rape_revised"           "robbery"               
## [11] "agg_assault"            "property_crime_total"  
## [13] "burglary"               "larceny"               
## [15] "vehicle_theft"          "X16"                   
## [17] "X17"                    "X18"                   
## [19] "X19"                    "X20"                   
## [21] "X21"

The ucr dataset has a lot of missing values, compared to the other datasets that have none. We dropped the last 6 columns that were completely empty and then we dropped rows consisting of only missing values. It leaves all columns without any missing values apart from “rape_revised” with 612 missing values and “rape_legacy” with 104 missing values.

# removing last 6 columns
ucr <- ucr[, -c(16:21)]
# removing all missing rows
ind <- apply(ucr, 1, function(x) all(is.na(x)))
ucr <- ucr[ !ind, ]
# showing sum of missing values per columns
sapply(ucr, function(x) sum(is.na(x)))
##           jurisdiction                   year crime_reporting_change 
##                      0                      0                      0 
##       crimes_estimated       state_population    violent_crime_total 
##                      0                      0                      0 
##    murder_manslaughter            rape_legacy           rape_revised 
##                      0                    104                    612 
##                robbery            agg_assault   property_crime_total 
##                      0                      0                      0 
##               burglary                larceny          vehicle_theft 
##                      0                      0                      0

As you can see on plot on the left below, in the last two years, 2016 and 2017, there is an additional obervation ie. jurisdiction. Looking at the plot on the right, New York is missing in one year, Puerto Rico is visible in only 3 years. District of Columbia is sometimes renamed as DC, but overall it sums up to all 17 years.

library(viridis)
## Loading required package: viridisLite
plot.data1 = ucr %>% group_by(year) %>% count()
ggp1 = ggplot(data = plot.data1, aes(x=year, y=n, fill=year)) + 
  geom_bar(stat = "identity") +
  scale_fill_viridis() +
  theme_minimal() + 
  theme(axis.title.x = element_blank(), 
        axis.title.y = element_blank(),
        legend.position = "none")

plot.data2 = ucr %>% group_by(jurisdiction) %>% count() %>% arrange(n) %>% filter(n<17)
ggp2 = ggplot(data = plot.data2, aes(x=jurisdiction, y=n, fill=jurisdiction)) + 
  geom_bar(stat = "identity") +
  theme_minimal() + 
  scale_fill_viridis_d() +
  theme(axis.title.x = element_blank(), 
        axis.title.y = element_blank(),
        legend.position = "none")

grid.arrange(ggp1, ggp2, ncol = 2)

Based on the above analysis, we decided to rename “DC” to “District of Columbia” and exclude Puerto Rico state.

ucr$jurisdiction[ucr$jurisdiction=="DC"] <- "District of Columbia"
ucr <- ucr %>% filter(jurisdiction!="Puerto Rico")
# interpolation <- data %>%
#   group_by(country) %>%
#   mutate(valueIpol = approx(year, women_part, year, 
#                             method = "linear", rule = 1:2, f = 0, ties = mean)$y)
# i=0
# for (i in seq_along(interpolation$valueIpol)) {
#   if (is.na(interpolation$women_part[i]) == FALSE) 
#     i = i+1
#   else if (is.na(interpolation$women_part[i]) == TRUE) 
#     interpolation$women_part[i] <- interpolation$valueIpol[i]
# }

We also analysed the missing values of variables rape_revised and rape_legacy. Because there are so many missings and they mostly do not occur in the same year, we can’t compare them and that’s why we decided to drop them.

rape_df <- data.frame(year=2001:2017)
rape_revised_count <- ucr[!is.na(ucr$rape_revised),] %>% 
                            group_by(year) %>% 
                            count(name="rape_revised_count")
rape_legacy_count <- ucr[!is.na(ucr$rape_legacy),] %>% 
                            group_by(year) %>% 
                            count(name="rape_legacy_count")
rape_df <- left_join(rape_df, rape_revised_count, by="year")
rape_df <- left_join(rape_df, rape_legacy_count, by="year")

Hide data

Show data

kable(rape_df)
year rape_revised_count rape_legacy_count
2001 NA 51
2002 NA 51
2003 NA 51
2004 NA 51
2005 NA 51
2006 NA 51
2007 NA 51
2008 NA 51
2009 NA 51
2010 NA 51
2011 NA 51
2012 NA 51
2013 51 51
2014 51 51
2015 50 50
2016 51 NA
2017 51 NA
ucr$rape_legacy <- NULL
ucr$rape_revised <- NULL
colnames(ucr)
##  [1] "jurisdiction"           "year"                  
##  [3] "crime_reporting_change" "crimes_estimated"      
##  [5] "state_population"       "violent_crime_total"   
##  [7] "murder_manslaughter"    "robbery"               
##  [9] "agg_assault"            "property_crime_total"  
## [11] "burglary"               "larceny"               
## [13] "vehicle_theft"
pl <- vector("list", length = ncol(ucr[,c(5:13)])-1)
colors <- viridis(8)
for(ii in seq_along(pl)){
  .col <- colnames(ucr[,c(5:13)])[-1][ii]
  .p <- ggplot(ucr, aes_string(x=.col, fill="colors[ii]", color="colors[ii]")) + 
          geom_density(alpha=0.3) + 
          scale_fill_manual(values = colors[ii], aesthetics = c("color", "fill")) +
          theme_minimal() +           
          theme(legend.position = "none",
                axis.title.x = element_blank(),
                axis.title.y = element_blank()) +
          labs(title = .col)
  
  pl[[ii]] <- .p
}

grid.arrange(grobs=pl)

(…)

Incarcarations in prison

prison <- read_csv("data/prison_custody_by_state.csv")
head(prison)
## # A tibble: 6 x 18
##   jurisdiction includes_jails `2001` `2002` `2003` `2004` `2005` `2006`
##   <chr>                 <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Federal                   0 149852 158216 168144 177600 186364 190844
## 2 Alabama                   0  24741  25100  27614  25635  24315  24103
## 3 Alaska                    1   4570   4351   4472   4534   4798   5052
## 4 Arizona                   0  27710  29359  31084  32384  33345  35752
## 5 Arkansas                  0  11489  11849  12068  12577  12455  12854
## 6 California                0 157142 159695 161785 163939 168035 172298
## # ... with 10 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## #   `2015` <dbl>, `2016` <dbl>

The prison data, compared to ucr is in a panel form, consisting of years as columns. Using long_panel we converted the dataframe so that each row is a different jurisdiction and year.

colnames(prison)[3:18] <- paste0(colnames(prison)[3:18],'1')
prison_panel <- long_panel(prison, begin = 2001, end = 2016, label_location = "beginning", id = "jurisdiction")
names(prison_panel)[names(prison_panel) == "wave"] <- "year"
names(prison_panel)[names(prison_panel) == "1"] <- "prison"
kable(head(prison_panel))
jurisdiction year includes_jails prison
Alabama 2001 0 24741
Alabama 2002 0 25100
Alabama 2003 0 27614
Alabama 2004 0 25635
Alabama 2005 0 24315
Alabama 2006 0 24103

Prison expenditures

prison_exp_2016 <- read_delim("data/prison_expenditures.csv", ";")
head(prison_exp_2016)
## # A tibble: 6 x 2
##   `State and type of government` `prison expenditure`
##   <chr>                                         <dbl>
## 1 Alabama                                      722269
## 2 Alaska                                       338005
## 3 Arizona                                     1684710
## 4 Arkansas                                     595731
## 5 California                                 15468283
## 6 Colorado                                    1313103

State area and region

In order to enhance further visualisations, we add an information about state area and region based on R built-in us_states dataset.

library(spData)
library(sf)
us_states_info <- data.frame(jurisdiction = us_states$NAME, 
                             region = us_states$REGION,
                             area_km2 = as.numeric(round(us_states$AREA, 0)))
us_states_info
##            jurisdiction   region area_km2
## 1               Alabama    South   133709
## 2               Arizona     West   295281
## 3              Colorado     West   269573
## 4           Connecticut Norteast    12977
## 5               Florida    South   151052
## 6               Georgia    South   152725
## 7                 Idaho     West   216513
## 8               Indiana  Midwest    93648
## 9                Kansas  Midwest   213037
## 10            Louisiana    South   122346
## 11        Massachusetts Norteast    20911
## 12            Minnesota  Midwest   218566
## 13             Missouri  Midwest   180716
## 14              Montana     West   380829
## 15               Nevada     West   286364
## 16           New Jersey Norteast    20274
## 17             New York Norteast   127202
## 18         North Dakota  Midwest   183178
## 19             Oklahoma    South   180971
## 20         Pennsylvania Norteast   117242
## 21       South Carolina    South    80904
## 22         South Dakota  Midwest   199767
## 23                Texas    South   687714
## 24              Vermont Norteast    24866
## 25        West Virginia    South    62813
## 26             Arkansas    South   137690
## 27           California     West   409747
## 28             Delaware    South     5182
## 29 District of Columbia    South      178
## 30             Illinois  Midwest   145993
## 31                 Iowa  Midwest   145744
## 32             Kentucky    South   104458
## 33                Maine Norteast    85520
## 34             Maryland    South    26849
## 35             Michigan  Midwest   151119
## 36          Mississippi    South   123745
## 37             Nebraska  Midwest   200272
## 38        New Hampshire Norteast    24026
## 39           New Mexico     West   314886
## 40       North Carolina    South   129233
## 41                 Ohio  Midwest   107051
## 42               Oregon     West   251346
## 43         Rhode Island Norteast     2743
## 44            Tennessee    South   109114
## 45                 Utah     West   219860
## 46             Virginia    South   105405
## 47           Washington     West   175436
## 48            Wisconsin  Midwest   144954
## 49              Wyoming     West   253310

Because of the fact that there are two states missing in the us_states_info dataset, we manually added region and land area for Hawaii and Alaska 6.

additional_states <- data.frame(jurisdiction = c("Hawaii", "Alaska"),
           region = c("remote", "remote"),
           area_km2 = c(16638, 1481346 ))

us_states_info <- rbind(us_states_info, additional_states)
us_states_info
##            jurisdiction   region area_km2
## 1               Alabama    South   133709
## 2               Arizona     West   295281
## 3              Colorado     West   269573
## 4           Connecticut Norteast    12977
## 5               Florida    South   151052
## 6               Georgia    South   152725
## 7                 Idaho     West   216513
## 8               Indiana  Midwest    93648
## 9                Kansas  Midwest   213037
## 10            Louisiana    South   122346
## 11        Massachusetts Norteast    20911
## 12            Minnesota  Midwest   218566
## 13             Missouri  Midwest   180716
## 14              Montana     West   380829
## 15               Nevada     West   286364
## 16           New Jersey Norteast    20274
## 17             New York Norteast   127202
## 18         North Dakota  Midwest   183178
## 19             Oklahoma    South   180971
## 20         Pennsylvania Norteast   117242
## 21       South Carolina    South    80904
## 22         South Dakota  Midwest   199767
## 23                Texas    South   687714
## 24              Vermont Norteast    24866
## 25        West Virginia    South    62813
## 26             Arkansas    South   137690
## 27           California     West   409747
## 28             Delaware    South     5182
## 29 District of Columbia    South      178
## 30             Illinois  Midwest   145993
## 31                 Iowa  Midwest   145744
## 32             Kentucky    South   104458
## 33                Maine Norteast    85520
## 34             Maryland    South    26849
## 35             Michigan  Midwest   151119
## 36          Mississippi    South   123745
## 37             Nebraska  Midwest   200272
## 38        New Hampshire Norteast    24026
## 39           New Mexico     West   314886
## 40       North Carolina    South   129233
## 41                 Ohio  Midwest   107051
## 42               Oregon     West   251346
## 43         Rhode Island Norteast     2743
## 44            Tennessee    South   109114
## 45                 Utah     West   219860
## 46             Virginia    South   105405
## 47           Washington     West   175436
## 48            Wisconsin  Midwest   144954
## 49              Wyoming     West   253310
## 50               Hawaii   remote    16638
## 51               Alaska   remote  1481346

Unifying state names

In the prison dataset, District of Columbia is named as Federal and in prison_exp_2016 is named as Washington, D.C., so in order to unify the names we ranamed both to District of Columbia. We also renamed the variable State and type of government to jurisdiction for easier further calculations.

setdiff(prison$jurisdiction %>% unique(), ucr$jurisdiction %>% unique())
## [1] "Federal"
setdiff(prison_exp_2016$`State and type of government` %>% unique(), ucr$jurisdiction %>% unique())
## [1] "Washington, D.C."
setdiff(ucr$jurisdiction %>% unique(), us_states_info$jurisdiction %>% unique())
## character(0)
prison$jurisdiction[prison$jurisdiction=="Federal"] <- "District of Columbia"
names(prison_exp_2016)[names(prison_exp_2016) == "State and type of government"] <- "jurisdiction"
prison_exp_2016$jurisdiction[prison_exp_2016$jurisdiction=="Washington, D.C."] <- "District of Columbia"

Background

According to recent surveys regarding the United States expenditures, spendings on incarceration have increased about three times as fast as spendings on elementary and secondary education during this time period. (…)

  1. liczba prisoners:
# p <- ggplot(data = df, aes(x = year, y = value, group = 1, 
#             text = paste("Year: ", year,
#                          "<br>Number of prisoners:", value))) +
#   geom_line() + 
#   geom_point() + 
#   # scale_color_viridis() + 
#   # scale_fill_viridis() +
#   labs(title = "Number of prisoners in the USA by year", x = "Year", y = "Number of prisoners") +
#   theme_minimal()
# 
# ggplotly(p, tooltip = "text")

Average historical violent crimes vs. property crimes per population across states

Below can be seen maps of US states and in

ucr_grouped <- ucr %>% 
                  group_by(jurisdiction) %>% 
                  summarise(
                    violent_crime_total = mean(violent_crime_total),
                    property_crime_total = mean(property_crime_total))
names(ucr_grouped)[names(ucr_grouped) == "jurisdiction"] <- "NAME"

ucr_grouped
## # A tibble: 51 x 3
##    NAME                 violent_crime_total property_crime_total
##    <chr>                              <dbl>                <dbl>
##  1 Alabama                           20979.              169904.
##  2 Alaska                             4581.               22255.
##  3 Arizona                           29705.              255059.
##  4 Arkansas                          14376.              104902.
##  5 California                       180014.             1076578.
##  6 Colorado                          17139.              153682.
##  7 Connecticut                        9864.               81138 
##  8 Delaware                           5212.               28289.
##  9 District of Columbia               8269.               30596.
## 10 Florida                          111554.              682763.
## # ... with 41 more rows
us_states_ucr <- merge(us_states, ucr_grouped, by = "NAME")
us_states_ucr$violent_crime_per_pop <- us_states_ucr$violent_crime_total/us_states_ucr$total_pop_15
us_states_ucr$property_crime_per_pop <- us_states_ucr$property_crime_total/us_states_ucr$total_pop_15

ggp1 <- ggplot(data = us_states_ucr) +
    geom_sf(aes(fill = property_crime_per_pop)) +
    scale_fill_viridis_c(option = "viridis", trans = "sqrt")

ggp2 <- ggplot(data = us_states_ucr) +
    geom_sf(aes(fill = violent_crime_per_pop)) +
    scale_fill_viridis_c(option = "viridis", trans = "sqrt")

Property crimes per population

Violent crimes per population

Statistical analysis of the dataset

jaka jest zależność między liczbą więźniów (prison) a wystąpieniami poszczególnych crime na przestrzeni lat (ucr)? czy wzrost uwięzionych zminiejsza odsetek jakiegoś typu przestępstw? czy może jest stały wzrost/spadek przestępstw? (geom line i geom smooth)

Does this significant investment into imprisonment improve public safety? wydatki na więzienia a wystąpienia przestępstw - ogółem i w kategoriach, w roku 2016 (najnowsze dane); source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=286

jak wygląda liczba uwięzionych na przestrzeni lat? dla całego kraju i dla poszczególnych stanów?

prison_country <- prison[,c(3:18)]
prison_country <- sapply(prison_country, sum)
df <- stack(prison_country)
colnames(df) <- c("value", "year")
p <- ggplot(data = df, aes(x = year, y = value, group = 1, 
            text = paste("Year: ", year,
                         "<br>Number of prisoners:", value))) +
  geom_line() + 
  geom_point() + 
  # scale_color_viridis() + 
  # scale_fill_viridis() +
  labs(title = "Number of prisoners in the USA by year", x = "Year", y = "Number of prisoners") +
  theme_minimal()

ggplotly(p, tooltip = "text")

dodatkowe zmienne -> area (ok) - w kodzie -> wydatki na prisons (ok) - w excelu , 2016

-> co poza mapą i bombelkami? - heatmapa -

-> https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/


  1. Source: https://www.kaggle.com/christophercorrea/prisoners-and-crime-in-united-states#ucr_by_state.csv

  2. Source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=269

  3. Source: https://www.ucrdatatool.gov/Search/Crime/State/RunCrimeStatebyState.cfm

  4. Source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=286

  5. “For agencies supplying 3 to 11 months of data, the national UCR Program estimates for the missing data by following a standard estimation procedure using the data provided by the agency. If an agency has supplied less than 3 months of data, the FBI computes estimates by using the known crime figures of similar areas within a state and assigning the same proportion of crime volumes to nonreporting agencies.” (cited from https://www.ucrdatatool.gov/faq.cfm)

  6. Sources: https://en.wikipedia.org/wiki/Alaska and https://en.wikipedia.org/wiki/Hawaii